{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BE97DJ2_2gYM"
   },
   "source": [
    "# Ungraded Lab: Building ML Pipelines with Kubeflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EzUU3ZPtib8K"
   },
   "source": [
    "In this lab, you will have some hands-on practice with [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/overview/pipelines-overview/). As mentioned in the lectures, modern ML engineering is moving towards pipeline automation for rapid iteration and experiment tracking. This is especially useful in production deployments where models need to be frequently retrained to catch trends in newer data.\n",
    "\n",
    "Kubeflow Pipelines is one component of the [Kubeflow](https://www.kubeflow.org/) suite of tools for machine learning workflows. It is deployed on top of a Kubernetes cluster and builds an infrastructure for orchestrating ML pipelines and monitoring inputs and outputs of each component. You will use this tool in Google Cloud Platform in the first assignment this week and this lab will help prepare you for that by exploring its features on a local deployment. In particular, you will:\n",
    "\n",
    "* setup [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/overview/pipelines-overview/) in your local workstation\n",
    "* get familiar with the Kubeflow Pipelines UI\n",
    "* build pipeline components with Python and the Kubeflow Pipelines SDK\n",
    "* run an ML pipeline with Kubeflow Pipelines\n",
    "\n",
    "Let's begin!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uOZgYS16iqAo"
   },
   "source": [
    "## Setup\n",
    "\n",
    "You will need these tool installed in your local machine to complete the exercises:\n",
    "\n",
    "1. Docker - platform for building and running containerized applications. You should already have this installed from the previous ungraded labs. If not, you can see the instructions [here](https://docs.docker.com/get-docker/). If you are using Docker for Desktop (Mac or Windows), you may need to increase the resource limits to start Kubeflow Pipelines later. You can click on the Docker icon in your Task Bar, choose `Preferences` and adjust the CPU to 4, Storage to 50GB, and the memory to at least 4GB (8GB recommended). Just make sure you are not maxing out any of these limits (i.e. the slider should ideally be at the midpoint or less) since it can make your machine slow or unresponsive. If you're constrained on resources, don't worry. You can still use this notebook as reference since we'll show the expected outputs at each step. The important thing is to become familiar with this Kubeflow Pipelines before you get more hands-on in the assignment.\n",
    "\n",
    "2. kubectl - tool for running commands on Kubernetes clusters. This should also be installed from the previous labs. If not, please see the instructions [here](https://kubernetes.io/docs/tasks/tools/)\n",
    "\n",
    "3. [kind](https://kind.sigs.k8s.io/) - a Kubernetes distribution for running local clusters using Docker. Please follow the instructions [here](https://www.kubeflow.org/docs/components/pipelines/installation/localcluster-deployment/#kind) to install kind and create a local cluster. (**NOTE: This lab was tested using `kind v0.20` running `Kubernetes v1.27.2`. You can check the default Kubernetes image used by the `kind` version you are about to download in the \"Breaking Changes\" section [here](https://github.com/kubernetes-sigs/kind/releases). If you want to use the same image, you can use the `--image` flag when creating the cluster (e.g. `kind create cluster --image=kindest/node:v1.27.2`). After creating the cluster, you can check the Kubernetes version with the command `kubectl version`.**)\n",
    "\n",
    "4. Kubeflow Pipelines (KFP) - a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. Once you've created a local cluster using `kind`, you can deploy Kubeflow Pipelines with these commands. (**NOTE: This lab was tested using KFP v1.8.5 with SDK v1.8.22**).\n",
    "\n",
    "```\n",
    "export PIPELINE_VERSION=1.8.5\n",
    "kubectl apply -k \"github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION\"\n",
    "kubectl wait --for condition=established --timeout=300s crd/applications.app.k8s.io\n",
    "kubectl apply -k \"github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION\"\n",
    "```\n",
    "\n",
    "You can  enter the commands above one line at a time. These will setup all the deployments and spin up the pods for the entire application. These will be found in the `kubeflow` namespace. After sending the last command, it will take a moment (around 30 minutes) for all the deployments to be ready. You can send the command `kubectl get deploy -n kubeflow` a few times to check the status. You should see all deployments with the `READY` status before you can proceed to the next section.\n",
    "\n",
    "```\n",
    "NAME                              READY   UP-TO-DATE   AVAILABLE   AGE\n",
    "cache-deployer-deployment         1/1     1            1           21h\n",
    "cache-server                      1/1     1            1           21h\n",
    "metadata-envoy-deployment         1/1     1            1           21h\n",
    "metadata-grpc-deployment          1/1     1            1           21h\n",
    "metadata-writer                   1/1     1            1           21h\n",
    "minio                             1/1     1            1           21h\n",
    "ml-pipeline                       1/1     1            1           21h\n",
    "ml-pipeline-persistenceagent      1/1     1            1           21h\n",
    "ml-pipeline-scheduledworkflow     1/1     1            1           21h\n",
    "ml-pipeline-ui                    1/1     1            1           21h\n",
    "ml-pipeline-viewer-crd            1/1     1            1           21h\n",
    "ml-pipeline-visualizationserver   1/1     1            1           21h\n",
    "mysql                             1/1     1            1           21h\n",
    "workflow-controller               1/1     1            1           21h\n",
    "```\n",
    "\n",
    "When everything is ready, you can run the following command to access the `ml-pipeline-ui` service.\n",
    "\n",
    "```\n",
    "kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80\n",
    "```\n",
    "\n",
    "The terminal should respond with something like this:\n",
    "\n",
    "```\n",
    "Forwarding from 127.0.0.1:8080 -> 3000\n",
    "Forwarding from [::1]:8080 -> 3000\n",
    "```\n",
    "\n",
    "You can then open your browser and go to `http://localhost:8080` to see the user interface.\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/kfp_ui.png?raw=1\" alt=\"kfp ui\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LbEdKUHBvLdi"
   },
   "source": [
    "## Operationalizing your ML Pipelines\n",
    "\n",
    "As you know, generating a trained model involves executing a sequence of steps. Here is a high level overview of what these steps might look like:\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/highlevel.jpg?raw=1\" alt=\"highlevel.jpg\">\n",
    "\n",
    "You can recall the very first model you ever built and more likely than not, your code then also followed a similar flow. In essence, building an ML pipeline mainly involves implementing these steps but you will need to optimize your operations to deliver value to your team. Platforms such as Kubeflow helps you to build ML pipelines that can be automated, reproducible, and easily monitored. You will see these as you build your pipeline in the next sections below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pWrq6Ean7ZVE"
   },
   "source": [
    "### Pipeline components\n",
    "\n",
    "The main building blocks of your ML pipeline are referred to as [components](https://www.kubeflow.org/docs/components/pipelines/overview/concepts/component/). In the context of Kubeflow, these are containerized applications that run a specific task in the pipeline. Moreover, these components generate and consume *artifacts* from other components. For example, a download task will generate a dataset artifact and this will be consumed by a data splitting task. If you go back to the simple pipeline image above and describe it using tasks and artifacts, it will look something like this:\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/simple_dag.jpg?raw=1\" alt=\"img/simple_dag.jpg\">\n",
    "\n",
    "This relationship between tasks and their artifacts are what constitutes a pipeline and is also called a [directed acyclic graph (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph).\n",
    "\n",
    "Kubeflow Pipelines let's you create components either by [building the component specification directly](https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/#component-spec) or through [Python functions](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/). For this lab, you will use the latter since it is more intuitive and allows for quick iteration. As you gain more experience, you can explore building the component specification directly especially if you want to use different languages other than Python.\n",
    "\n",
    "You will begin by installing the Kubeflow Pipelines SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4IvRt6wC2n8Y"
   },
   "outputs": [],
   "source": [
    "# Install the KFP SDK\n",
    "!pip install --upgrade kfp==1.8.22"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DmZeOyVu8MyJ"
   },
   "source": [
    "Now you will import the modules you will be using to construct the Kubeflow pipeline. You will know more what these are for in the next sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cSt2DEJA2ttR"
   },
   "outputs": [],
   "source": [
    "# Import the modules you will use\n",
    "import kfp\n",
    "\n",
    "# For creating the pipeline\n",
    "from kfp.v2 import dsl\n",
    "\n",
    "# For building components\n",
    "from kfp.v2.dsl import component\n",
    "\n",
    "# Type annotations for the component artifacts\n",
    "from kfp.v2.dsl import (\n",
    "    Input,\n",
    "    Output,\n",
    "    Artifact,\n",
    "    Dataset,\n",
    "    Model,\n",
    "    Metrics\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MV8AZsyW8ahR"
   },
   "source": [
    "In this lab, you will build a pipeline to train a multi-output model trained on the [Energy Effeciency dataset from the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency). It uses the bulding features (e.g. wall area, roof area) as inputs and has two outputs: Cooling Load and Heating Load. You will follow the five-task graph above with some slight differences in the generated artifacts.\n",
    "\n",
    "You will now build the component to load your data into the pipeline. The code is shown below and we will discuss the syntax in more detail after running it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gT4SZtZM22Gc"
   },
   "outputs": [],
   "source": [
    "@component(\n",
    "    packages_to_install=[\"pandas\", \"openpyxl\"],\n",
    "    output_component_file=\"download_data_component.yaml\"\n",
    ")\n",
    "def download_data(url:str, output_csv:Output[Dataset]):\n",
    "    import pandas as pd\n",
    "\n",
    "    # Use pandas excel reader\n",
    "    df = pd.read_excel(url)\n",
    "    df = df.sample(frac=1).reset_index(drop=True)\n",
    "    df.to_csv(output_csv.path, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UAa5GSbDaJpd"
   },
   "source": [
    "When building a component, it's good to determine first its inputs and outputs.\n",
    "\n",
    "* The dataset you want to download is an Excel file hosted by UCI [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx) and you can load that using Pandas. Instead of hardcoding the URL in your code, you can design your function to accept an *input* string parameter so you can use other URLs in case the data has been transferred.\n",
    "\n",
    "* For the *output*, you will want to pass the downloaded dataset to the next task (i.e. data splitting). You should assign this as an `Output` type and specify what kind of artifact it is. Kubeflow provides [several of these](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/components/types/artifact_types.py) such as `Dataset`, `Model`, `Metrics`, etc. All artifacts are saved by Kubeflow to a storage server. For local deployments, the default will be a [MinIO](https://min.io/) server. The [path](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/components/types/artifact_types.py#L51) property fetches the location where this artifact will be saved and that's what you did above when you called `df.to_csv(output_csv.path, index=False)`\n",
    "\n",
    "The inputs and outputs are declared as parameters in the function definition. As you can see in the code we defined a `url` parameter with a `str` type and an `output_csv` parameter with an `Output[Dataset]` type.\n",
    "\n",
    "Lastly, you'll need to use the `component` decorator to specify that this is a Kubeflow Pipeline component. The [documentation](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/components/component_decorator.py#L23) shows several parameters you can set and two of them are used in the code above. As the name suggests, the `packages_to_install` argument declares any extra packages outside the base image that is needed to run your code. As of writing, the default base image is `python:3.7` so you'll need `pandas` and `openpyxl` to load the Excel file.\n",
    "\n",
    "The `output_component_file` is an output file that contains the specification for your newly built component. You should see it in the Colab file explorer once you've ran the cell above. You'll see your code there and other settings that pertain to your component. You can use this file when building other pipelines if necessary. You don't have to redo your code again in a notebook in your next project as long as you have this YAML file. You can also pass this to your team members or use it in another machine. Kubeflow also hosts other reusable modules in their repo [here](https://github.com/kubeflow/pipelines/tree/master/components). For example, if you want a file downloader component in one of your projects, you can load the component from that repo using the [load_component_from_url](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.ComponentStore.load_component_from_url) function as shown below. The [YAML file](https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component-sdk-v2.yaml) of that component should tell you the inputs and outputs so you can use it accordingly.\n",
    "\n",
    "```\n",
    "web_downloader_op = kfp.components.load_component_from_url(\n",
    "    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component-sdk-v2.yaml')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8sNacAzvh6Ei"
   },
   "source": [
    "Next, you will build the next component in the pipeline. Like in the previous step, you should design it first with inputs and outputs in mind. You know that the input of this component will come from the artifact generated by the `download_data()` function above. To declare input artifacts, you can annotate your parameter with the `Input[Dataset]` data type as shown below. For the outputs, you want to have two: train and test datasets. You can see the implementation below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "zpItc-Ob6pnO"
   },
   "outputs": [],
   "source": [
    "@component(\n",
    "    packages_to_install=[\"pandas\", \"scikit-learn\"],\n",
    "    output_component_file=\"split_data_component.yaml\"\n",
    ")\n",
    "def split_data(input_csv: Input[Dataset], train_csv: Output[Dataset], test_csv: Output[Dataset]):\n",
    "    import pandas as pd\n",
    "    from sklearn.model_selection import train_test_split\n",
    "\n",
    "    df = pd.read_csv(input_csv.path)\n",
    "    train, test = train_test_split(df, test_size=0.2)\n",
    "\n",
    "    train.to_csv(train_csv.path, index=False)\n",
    "    test.to_csv(test_csv.path, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9ZM0MDM4qweD"
   },
   "source": [
    "### Building and Running a Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JTQVk643lDMo"
   },
   "source": [
    "Now that you have at least two components, you can try building a pipeline just to quickly see how it works. The code is shown below. Basically, you just define a function with the sequence of steps then use the `dsl.pipeline` decorator. Notice in the last line (i.e. `split_data_task`) that to get a particular artifact from a previous step, you will need to use the `outputs` dictionary and use the parameter name as the key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wZ-U_xsbLOIH"
   },
   "outputs": [],
   "source": [
    "@dsl.pipeline(\n",
    "    name=\"my-pipeline\",\n",
    ")\n",
    "def my_pipeline(url: str):\n",
    "    download_data_task = download_data(url=url)\n",
    "    split_data_task = split_data(input_csv=download_data_task.outputs['output_csv'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OQZH5d2omdos"
   },
   "source": [
    "To generate your pipeline specification file, you need to compile your pipeline function using the [`Compiler`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.compiler.html#kfp.compiler.Compiler) class as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JKFD7AGgLvHV"
   },
   "outputs": [],
   "source": [
    "kfp.compiler.Compiler(mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE).compile(\n",
    "    pipeline_func=my_pipeline,\n",
    "    package_path='pipeline.yaml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tfB-1JyInB-s"
   },
   "source": [
    "After running the cell, you'll see a `pipeline.yaml` file in the Colab file explorer. Please download that because it will be needed in the next step.\n",
    "\n",
    "You can run a pipeline programmatically or from the UI. For this exercise, you will do it from the UI and you will see how it is done programmatically in the Qwiklabs later this week.\n",
    "\n",
    "Please go back to the Kubeflow Pipelines UI and click `Upload Pipelines` from the `Pipelines` page.\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/upload.png?raw=1\" alt=\"upload.png\" width=\"800\">\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "Next, select `Upload a file` and choose the `pipeline.yaml` you downloaded earlier then click `Create`. This will open a screen showing your simple DAG (just two tasks).\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/dag_kfp.png?raw=1\" alt=\"dag_kfp.png\" width=\"640\">\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "Click `Create Run` then scroll to the bottom to input the URL of the Excel file: https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx . Then Click `Start`.\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/url.png?raw=1\" alt=\"url.png\" width=\"640\">\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "Select the topmost entry in the `Runs` page and you should see the progress of your run. You can click on the `download-data` box to see more details about that particular task (i.e. the URL input and the container logs). After it turns green, you should also see the output artifact and you can download it if you want by clicking the minio link.\n",
    "\n",
    "<img src=\"https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/progress.png?raw=1\" alt=\"progress.png\" width=\"800\">\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "Eventually, both tasks will turn green indicating that the run completed successfully. Nicely done!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9eBSFSmuq-l7"
   },
   "source": [
    "### Generate the rest of the components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yQGXOPvms2sW"
   },
   "source": [
    "Now that you've seen a sample workflow, you can build the rest of the components for preprocessing, model training, and model evaluation. The functions will be longer because the task is more complex. Nonetheless, it follows the same principles as before such as declaring inputs and outputs, and specifying the additional packages.\n",
    "\n",
    "In the `eval_model()` function, you'll notice the use of the [`log_metric()`](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/components/types/artifact_types.py#L123) to record the results. You'll see this in the `Visualizations` tab of that task after it has completed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sF6gLo0w6nA4"
   },
   "outputs": [],
   "source": [
    "@component(\n",
    "    packages_to_install=[\"pandas\", \"numpy\"],\n",
    "    output_component_file=\"preprocess_data_component.yaml\"\n",
    ")\n",
    "def preprocess_data(input_train_csv: Input[Dataset], input_test_csv: Input[Dataset],\n",
    "                    output_train_x: Output[Dataset], output_test_x: Output[Dataset],\n",
    "                    output_train_y: Output[Artifact], output_test_y: Output[Artifact]):\n",
    "\n",
    "    import pandas as pd\n",
    "    import numpy as np\n",
    "    import pickle\n",
    "\n",
    "    def format_output(data):\n",
    "        y1 = data.pop('Y1')\n",
    "        y1 = np.array(y1)\n",
    "        y2 = data.pop('Y2')\n",
    "        y2 = np.array(y2)\n",
    "        return y1, y2\n",
    "\n",
    "    def norm(x, train_stats):\n",
    "        return (x - train_stats['mean']) / train_stats['std']\n",
    "\n",
    "    train = pd.read_csv(input_train_csv.path)\n",
    "    test = pd.read_csv(input_test_csv.path)\n",
    "\n",
    "    train_stats = train.describe()\n",
    "\n",
    "    # Get Y1 and Y2 as the 2 outputs and format them as np arrays\n",
    "    train_stats.pop('Y1')\n",
    "    train_stats.pop('Y2')\n",
    "    train_stats = train_stats.transpose()\n",
    "\n",
    "    train_Y = format_output(train)\n",
    "    with open(output_train_y.path, \"wb\") as file:\n",
    "      pickle.dump(train_Y, file)\n",
    "\n",
    "    test_Y = format_output(test)\n",
    "    with open(output_test_y.path, \"wb\") as file:\n",
    "      pickle.dump(test_Y, file)\n",
    "\n",
    "    # Normalize the training and test data\n",
    "    norm_train_X = norm(train, train_stats)\n",
    "    norm_test_X = norm(test, train_stats)\n",
    "\n",
    "    norm_train_X.to_csv(output_train_x.path, index=False)\n",
    "    norm_test_X.to_csv(output_test_x.path, index=False)\n",
    "\n",
    "\n",
    "\n",
    "@component(\n",
    "    packages_to_install=[\"tensorflow\", \"pandas\"],\n",
    "    output_component_file=\"train_model_component.yaml\"\n",
    ")\n",
    "def train_model(input_train_x: Input[Dataset], input_train_y: Input[Artifact],\n",
    "                output_model: Output[Model], output_history: Output[Artifact]):\n",
    "    import pandas as pd\n",
    "    import tensorflow as tf\n",
    "    import pickle\n",
    "\n",
    "    from tensorflow.keras.models import Model\n",
    "    from tensorflow.keras.layers import Dense, Input\n",
    "\n",
    "    norm_train_X = pd.read_csv(input_train_x.path)\n",
    "\n",
    "    with open(input_train_y.path, \"rb\") as file:\n",
    "        train_Y = pickle.load(file)\n",
    "\n",
    "    def model_builder(train_X):\n",
    "\n",
    "      # Define model layers.\n",
    "      input_layer = Input(shape=(len(train_X.columns),))\n",
    "      first_dense = Dense(units='128', activation='relu')(input_layer)\n",
    "      second_dense = Dense(units='128', activation='relu')(first_dense)\n",
    "\n",
    "      # Y1 output will be fed directly from the second dense\n",
    "      y1_output = Dense(units='1', name='y1_output')(second_dense)\n",
    "      third_dense = Dense(units='64', activation='relu')(second_dense)\n",
    "\n",
    "      # Y2 output will come via the third dense\n",
    "      y2_output = Dense(units='1', name='y2_output')(third_dense)\n",
    "\n",
    "      # Define the model with the input layer and a list of output layers\n",
    "      model = Model(inputs=input_layer, outputs=[y1_output, y2_output])\n",
    "\n",
    "      print(model.summary())\n",
    "\n",
    "      return model\n",
    "\n",
    "    model = model_builder(norm_train_X)\n",
    "\n",
    "    # Specify the optimizer, and compile the model with loss functions for both outputs\n",
    "    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n",
    "    model.compile(optimizer=optimizer,\n",
    "                  loss={'y1_output': 'mse', 'y2_output': 'mse'},\n",
    "                  metrics={'y1_output': tf.keras.metrics.RootMeanSquaredError(),\n",
    "                          'y2_output': tf.keras.metrics.RootMeanSquaredError()})\n",
    "    # Train the model for 500 epochs\n",
    "    history = model.fit(norm_train_X, train_Y, epochs=100, batch_size=10)\n",
    "    model.save(output_model.path)\n",
    "\n",
    "    with open(output_history.path, \"wb\") as file:\n",
    "        train_Y = pickle.dump(history.history, file)\n",
    "\n",
    "\n",
    "\n",
    "@component(\n",
    "    packages_to_install=[\"tensorflow\", \"pandas\"],\n",
    "    output_component_file=\"eval_model_component.yaml\"\n",
    ")\n",
    "def eval_model(input_model: Input[Model], input_history: Input[Artifact],\n",
    "               input_test_x: Input[Dataset], input_test_y: Input[Artifact],\n",
    "               MLPipeline_Metrics: Output[Metrics]):\n",
    "    import pandas as pd\n",
    "    import tensorflow as tf\n",
    "    import pickle\n",
    "\n",
    "    model = tf.keras.models.load_model(input_model.path)\n",
    "\n",
    "    norm_test_X = pd.read_csv(input_test_x.path)\n",
    "\n",
    "    with open(input_test_y.path, \"rb\") as file:\n",
    "        test_Y = pickle.load(file)\n",
    "\n",
    "    # Test the model and print loss and mse for both outputs\n",
    "    loss, Y1_loss, Y2_loss, Y1_rmse, Y2_rmse = model.evaluate(x=norm_test_X, y=test_Y)\n",
    "    print(\"Loss = {}, Y1_loss = {}, Y1_mse = {}, Y2_loss = {}, Y2_mse = {}\".format(loss, Y1_loss, Y1_rmse, Y2_loss, Y2_rmse))\n",
    "\n",
    "    MLPipeline_Metrics.log_metric(\"loss\", loss)\n",
    "    MLPipeline_Metrics.log_metric(\"Y1_loss\", Y1_loss)\n",
    "    MLPipeline_Metrics.log_metric(\"Y2_loss\", Y2_loss)\n",
    "    MLPipeline_Metrics.log_metric(\"Y1_rmse\", Y1_rmse)\n",
    "    MLPipeline_Metrics.log_metric(\"Y2_rmse\", Y2_rmse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JEsO8UYurD1k"
   },
   "source": [
    "### Build and run the complete pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7XqEnO97vIwY"
   },
   "source": [
    "You can then build and run the entire pipeline as you did earlier. It will take around 20 minutes for all the tasks to complete and you can see the `Logs` tab of each task to see how it's going. For instance, you can see there the model training epochs as you normally see in a notebook environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HqD895So2-h2"
   },
   "outputs": [],
   "source": [
    "# Define a pipeline and create a task from a component:\n",
    "@dsl.pipeline(\n",
    "    name=\"my-pipeline\",\n",
    ")\n",
    "def my_pipeline(url: str):\n",
    "\n",
    "    download_data_task = download_data(url=url)\n",
    "\n",
    "    split_data_task = split_data(input_csv=download_data_task.outputs['output_csv'])\n",
    "\n",
    "    preprocess_data_task = preprocess_data(input_train_csv=split_data_task.outputs['train_csv'],\n",
    "                                           input_test_csv=split_data_task.outputs['test_csv'])\n",
    "\n",
    "    train_model_task = train_model(input_train_x=preprocess_data_task.outputs[\"output_train_x\"],\n",
    "                                   input_train_y=preprocess_data_task.outputs[\"output_train_y\"])\n",
    "\n",
    "    eval_model_task = eval_model(input_model=train_model_task.outputs[\"output_model\"],\n",
    "                                 input_history=train_model_task.outputs[\"output_history\"],\n",
    "                                   input_test_x=preprocess_data_task.outputs[\"output_test_x\"],\n",
    "                                   input_test_y=preprocess_data_task.outputs[\"output_test_y\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UNPq9D263A3d"
   },
   "outputs": [],
   "source": [
    "kfp.compiler.Compiler(mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE).compile(\n",
    "    pipeline_func=my_pipeline,\n",
    "    package_path='pipeline.yaml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Kg0-GihkU4HU"
   },
   "source": [
    "After you've uploaded and ran the entire pipeline, you should see all green boxes and the training metrics in the `Visualizations` tab of the `eval-model` task.\n",
    "\n",
    "<img src='https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/blob/main/course4/week3-ungraded-labs/C4_W3_Lab_1_Intro_to_KFP/img/complete_pipeline.png?raw=1' alt=\"./img/complete_pipeline.png\" width=640>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9bs8p5KZGCgI"
   },
   "source": [
    "## Tear Down\n",
    "\n",
    "If you're done experimenting with the software and want to free up resources, you can execute the commands below to delete Kubeflow Pipelines from your system:\n",
    "\n",
    "```\n",
    "export PIPELINE_VERSION=1.8.5\n",
    "kubectl delete -k \"github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION\"\n",
    "kubectl delete -k \"github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION\"\n",
    "```\n",
    "\n",
    "You can delete the cluster for `kind` with the following:\n",
    "```\n",
    "kind delete cluster\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PUFoY2iqIHyW"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "This lab demonstrated how you can use Kubeflow Pipelines to build and orchestrate your ML workflows. Having automated, shareable, and modular pipelines is a very useful feature in production deployments so you and your team can monitor and maintain your system more effectively. In the first Qwiklabs this week, you will use Kubeflow Pipelines as part of Vertex AI. You'll see more features implemented there such as integration with Tensorboard and more output visualizations from each component.\n",
    "\n",
    "Great job and on to the next part of the course!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "C4_W3_Lab_1_Kubeflow_Pipelines.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
